Private-Sector AI Indicators – Center for Security and Emerging Technology (CSET), Georgetown University https://eto.tech/dataset-docs/private-sector-ai-indicators/ Zenodo Record: https://zenodo.org/records/14518311
The source links directly to the official documentation page for the Private-Sector AI Indicators dataset, which is produced by the Center for Security and Emerging Technology (CSET) at Georgetown University. Because the dataset is published by a recognized academic research institution and distributed through Zenodo with a Digital Object Identifier (DOI), it is transparent and traceable. This indicates that the data comes from a credible research organization rather than an anonymous or synthetic source. The dataset reflects real-world corporate AI activity and is based on systematically collected information such as AI publications and patent filings. Since the documentation clearly explains how the dataset was constructed and what it measures, we can confirm that this is a real and reliable dataset suitable for analysis.
The dataset represents a collection of company-level indicators measuring AI activity in the private sector. Each row is a specific company, and the columns include quantitative measures such as AI-related research publications, AI-related patent filings, and workforce-related indicators. These variables capture different dimensions of AI innovation and technological engagement within companies. The information contained in this dataset is relevant to our project because our project aims to analyze patterns of AI activity across companies. This dataset provides measurable indicators that allow us to compare companies based on their AI research output and innovation intensity. By using this dataset, we can identify differences in AI development across firms and industries.
We plan to join the country-level aggregates from this dataset with our second dataset. Since the two datasets may use slightly different naming conventions for countries, we will standardize the country names before merging. We will convert all country names to a consistent format and ensure that spelling and capitalization match. We plan to do an inner join so that only countries appearing in both datasets are retained in the final merged dataset.
Rows: 691
Columns: 63
$ Name <chr> "Acc…
$ ID <int> 803,…
$ Country <chr> "Ire…
$ Website <chr> "htt…
$ Groups <chr> "S&P…
$ Aggregated.subsidiaries <chr> "", …
$ Region <chr> "Eur…
$ Stage <chr> "Mat…
$ Sector <chr> "Sof…
$ Description <chr> "Acc…
$ Description.source <chr> "cru…
$ Description.link <chr> "htt…
$ Description.date <chr> "202…
$ Publications..AI.publications <int> 249,…
$ Publications..Recent.AI.publication.growth <dbl> 28.9…
$ Publications..AI.publication.percentage <dbl> 19.2…
$ Publications..AI.publications.in.top.conferences <int> 24, …
$ Publications..Citations.to.AI.research <int> 4557…
$ Publications..CV.publications <int> 34, …
$ Publications..NLP.publications <int> 60, …
$ Publications..Robotics.publications <int> 18, …
$ Publications..AI.safety.publications <int> 8, 9…
$ Publications..Large.language.model.publications <int> 8, 2…
$ Publications..Total.publications <int> 1297…
$ Patents..AI.patents <int> 494,…
$ Patents..AI.patents..recent.growth <dbl> 49.1…
$ Patents..AI.patent.percentage <dbl> 40.1…
$ Patents..Granted.AI.patents <int> 248,…
$ Patents..Total.patents <int> 1231…
$ Patents..AI.use.cases..Agriculture <int> 3, 0…
$ Patents..AI.use.cases..Banking.and.finance <int> 38, …
$ Patents..AI.use.cases..Business <int> 157,…
$ Patents..AI.use.cases..Computing.in.government <int> 0, 1…
$ Patents..AI.use.cases..Document.management.and.publishing <int> 0, 0…
$ Patents..AI.use.cases..Education <int> 5, 0…
$ Patents..AI.use.cases..Energy <int> 4, 0…
$ Patents..AI.use.cases..Entertainment <int> 1, 1…
$ Patents..AI.use.cases..Industry.and.manufacturing <int> 11, …
$ Patents..AI.use.cases..Life.sciences <int> 34, …
$ Patents..AI.use.cases..Military <int> 0, 0…
$ Patents..AI.use.cases..Nanotechnology <int> 0, 0…
$ Patents..AI.use.cases..Networking <int> 5, 0…
$ Patents..AI.use.cases..Personal.devices.and.computing <int> 361,…
$ Patents..AI.use.cases..Physical.sciences.and.engineering <int> 3, 0…
$ Patents..AI.use.cases..Security <int> 49, …
$ Patents..AI.use.cases..Semiconductors <int> 1, 0…
$ Patents..AI.use.cases..Telecommunications <int> 100,…
$ Patents..AI.use.cases..Transportation <int> 9, 0…
$ Patents..AI.applications.and.techniques..Analytics.and.algorithms <int> 30, …
$ Patents..AI.applications.and.techniques..Computer.vision <int> 69, …
$ Patents..AI.applications.and.techniques..Control <int> 38, …
$ Patents..AI.applications.and.techniques..Distributed.AI <int> 8, 1…
$ Patents..AI.applications.and.techniques..Knowledge.representation <int> 102,…
$ Patents..AI.applications.and.techniques..Language.processing <int> 33, …
$ Patents..AI.applications.and.techniques..Measuring.and.testing <int> 13, …
$ Patents..AI.applications.and.techniques..Planning.and.scheduling <int> 129,…
$ Patents..AI.applications.and.techniques..Robotics <int> 0, 0…
$ Patents..AI.applications.and.techniques..Speech.processing <int> 23, …
$ Workforce..AI.workers <int> 1422…
$ Workforce..Tech.Team.1.workers <int> 1705…
$ City <chr> "Dub…
$ State.province <chr> "Dub…
$ PARAT.link <chr> "htt…
This dataset contains 692 observable rows with 64 columns measuring AI publication percentages and related company attributes. Each row represents a company, and the data includes variables that allow us to calculate average AI publication intensity at the country level. These columns are directly related to our project because they allow us to measure how strongly companies in different countries engage in AI research.
From the analyzed data, we observed that some companies may have missing country information or incomplete publication percentage values. Additionally, country names may appear in slightly different formats, such as USA vs United States. However, these issues can be resolved through standardization and aggregation. Overall, there are no major structural problems in the dataset that would prevent it from being used in our analysis.
Source Link: EU Industrial R&D Investment Scoreboard – World 2000 Edition European Commission, Joint Research Centre (JRC) https://iri.jrc.ec.europa.eu/scoreboard/2025-eu-industrial-rd-investment-scoreboard
The second dataset used in our project is the EU Industrial R&D Investment Scoreboard (World 2000). It is published by the European Commission through the Joint Research Centre (JRC) and the Directorate-General for Research and Innovation. This dataset reports R&D investment data for the world’s top 2000 corporate R&D investors.The data are compiled from audited company annual financial reports. This means the R&D expenditure values come directly from official company financial statements. The 2025 edition reflects company data from fiscal year 2024. Because the dataset is produced by a government research body and based on audited financial records, it qualifies as real and credible data. It measures actual corporate investment behavior rather than simulated or synthetic values.
This dataset contains company-level R&D investment data from different industries and countries. It focuses on overall research and development spending rather than AI-specific research.The dataset includes variables such as total R&D expenditure, R&D intensity, net sales, and number of employees. These variables show how much companies invest in innovation and how large they are.This information is relevant to our project because we can compare company R&D investment with AI research output from our first dataset. This helps us examine whether companies that spend more on R&D also produce more AI research.
We plan to join this dataset with our first dataset using company names. The key column in the Scoreboard is Company, and the key column in our first dataset is also the company name. Company names may not match exactly due to punctuation, abbreviations, or legal suffixes. We will standardize names by converting to lowercase, removing extra spaces, and removing punctuation. We will then perform an inner join to keep only companies that appear in both datasets.
Rows: 2,000
Columns: 24
$ Year <dbl> 2024, 2024, 2024, 2024, 20…
$ `World rank` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9,…
$ Company <chr> "AMAZON.COM, INC.", "ALPHA…
$ Country <chr> "US", "US", "US", "US", "U…
$ Region <chr> "US", "US", "US", "US", "U…
$ `ISO3 Code` <chr> "USA", "USA", "USA", "USA"…
$ `ICB3 short` <chr> "ICT services", "ICT servi…
$ `ICB3 long` <chr> "Software & Computer Servi…
$ `R&D (€million)` <dbl> 65318.282, 46131.485, 4198…
$ `R&D one-year growth (%)` <dbl> 10.327797, 9.647899, 19.54…
$ `Net sales (€million)` <dbl> 614071.61, 336912.12, 1583…
$ `Net sales one-year growth (%)` <dbl> 10.9908922, 13.8662433, 21…
$ `R&D intensity (%)` <dbl> 10.636916, 13.692439, 26.5…
$ `Capex (€million)` <dbl> 79891.231, 50567.908, 3586…
$ `Capex one-year growth (%)` <dbl> 57.406740, 62.894174, 36.6…
$ `Capex intensity (%)` <dbl> 13.010084, 15.009228, 22.6…
$ `R&D-to-capex ratio (%)` <dbl> 81.75901, 91.22680, 117.07…
$ `Operating profit (€million)` <dbl> 66024.641, 108181.731, 664…
$ `Operating profit one-year growth (%)` <dbl> 86.131011, 33.332542, 59.3…
$ `Profitability (%)` <dbl> 10.751945, 32.109777, 41.9…
$ `Market capitalisation (€million)` <dbl> 2220511.39, 1064664.94, 12…
$ `Market capitalisation one-year growth (%)` <dbl> 39.0005792, 34.6378175, 63…
$ Employees <dbl> 1556000, 183323, 74067, 22…
$ `Employees one-year growth (%)` <dbl> 2.0327869, 0.4498581, 10.0…
This dataset contains 2000 rows and 24 columns. Each row represents one company among the top global R&D investors. The dataset includes financial and performance variables such as R&D spending, net sales, and employee counts.The R&D variable is measured in millions of euros. This allows us to directly compare the scale of company investment. The dataset also includes the country of headquarters, which can be used for additional aggregation if needed.
One issue in our combined dataset is the difference in measurement units between the two data sources. Dataset 1 measures AI output using patent counts and publication percentages. Dataset 2 measures total R&D investment in millions of euros and R&D intensity percentages. Because the units are different, we cannot directly compare raw values across the two datasets. AI patents represent output, while R&D spending represents input. This creates a scale and dimension difference. To address this issue, we created a standardized efficiency variable. We calculated AI patents per €100 million of R&D spending. This allows us to compare companies more fairly and control for differences in company size and investment scale.
[1] "Year"
[2] "World rank"
[3] "Company"
[4] "Country"
[5] "Region"
[6] "ISO3 Code"
[7] "ICB3 short"
[8] "ICB3 long"
[9] "R&D (€million)"
[10] "R&D one-year growth (%)"
[11] "Net sales (€million)"
[12] "Net sales one-year growth (%)"
[13] "R&D intensity (%)"
[14] "Capex (€million)"
[15] "Capex one-year growth (%)"
[16] "Capex intensity (%)"
[17] "R&D-to-capex ratio (%)"
[18] "Operating profit (€million)"
[19] "Operating profit one-year growth (%)"
[20] "Profitability (%)"
[21] "Market capitalisation (€million)"
[22] "Market capitalisation one-year growth (%)"
[23] "Employees"
[24] "Employees one-year growth (%)"
[1] 364
# A tibble: 6 × 10
Company `R&D (€million)` `R&D intensity (%)` `ICB3 short` clean_name.x core_id
<chr> <dbl> <dbl> <chr> <chr> <int>
1 MICROS… 31272. 11.5 ICT services MICROSOFT 5
2 APPLE … 30195. 8.02 ICT produce… APPLE . 9
3 VOLKSW… 20998 6.47 Automobiles… VOLKSWAGEN … 425
4 JOHNSO… 16587. 19.4 Health indu… JOHNSON & J… 98
5 INTEL … 15926. 31.2 ICT produce… INTEL 13
6 NVIDIA… 12430. 9.90 ICT produce… NVIDIA 31
# ℹ 4 more variables: Name <chr>, Country <chr>, Patents..AI.patents <int>,
# clean_name.y <chr>
[1] "Company" "R&D (€million)" "R&D intensity (%)"
[4] "ICB3 short" "clean_name.x" "core_id"
[7] "Name" "Country" "Patents..AI.patents"
[10] "clean_name.y"
[1] 364
Rows: 364
Columns: 10
$ Company <chr> "MICROSOFT CORPORATION", "APPLE INC.", "VOLKSWAGEN…
$ `R&D (€million)` <dbl> 31271.537, 30195.399, 20998.000, 16586.774, 15926.…
$ `R&D intensity (%)` <dbl> 11.531854, 8.022300, 6.467769, 19.400817, 31.15948…
$ `ICB3 short` <chr> "ICT services", "ICT producers", "Automobiles & Pa…
$ clean_name.x <chr> "MICROSOFT", "APPLE .", "VOLKSWAGEN AG", "JOHNSON …
$ core_id <int> 5, 9, 425, 98, 13, 31, 131, 165, 150, 7, 233, 69, …
$ Name <chr> "Microsoft", "Apple", "Volkswagen", "Johnson & Joh…
$ Country <chr> "United States", "United States", "Germany", "Unit…
$ Patents..AI.patents <int> 4146, 787, 36, 275, 2191, 1457, 24, 7, 13, 441, 24…
$ clean_name.y <chr> "MICROSOFT", "APPLE", "VOLKSWAGEN", "JOHNSON & JOH…
# A tibble: 10 × 10
Company `R&D (€million)` `R&D intensity (%)` `ICB3 short` clean_name.x
<chr> <dbl> <dbl> <chr> <chr>
1 MICROSOFT COR… 31272. 11.5 ICT services MICROSOFT
2 APPLE INC. 30195. 8.02 ICT produce… APPLE .
3 VOLKSWAGEN AG 20998 6.47 Automobiles… VOLKSWAGEN …
4 JOHNSON & JOH… 16587. 19.4 Health indu… JOHNSON & J…
5 INTEL CORP 15926. 31.2 ICT produce… INTEL
6 NVIDIA CORP 12430. 9.90 ICT produce… NVIDIA
7 ASTRAZENECA P… 12033. 23.1 Health indu… ASTRAZENECA
8 ELI LILLY AND… 10579. 24.4 Health indu… ELI LILLY A…
9 PFIZER INC 10336. 16.9 Health indu… PFIZER
10 ORACLE CORP 9491. 17.2 ICT services ORACLE
# ℹ 5 more variables: core_id <int>, Name <chr>, Country <chr>,
# Patents..AI.patents <int>, clean_name.y <chr>
Source: Center for Security and Emerging Technology (CSET) https://eto.tech/dataset-docs/private-sector-ai-indicators/
This dataset contains company-level AI research data. It measures AI-related publication activity. The data are collected from research publication databases and classified using AI topic methods.
The most important variables are:
Company – Company nameCountry – Headquarters countryPublications..AI.publication.percentage – Percentage of publications related to AIThe publication percentage measures AI research intensity. This is the key outcome variable in our project.
Source: European Commission, Joint Research Centre https://iri.jrc.ec.europa.eu/scoreboard/2025-eu-industrial-rd-investment-scoreboard
This dataset contains company-level R&D investment data. The data come from audited company annual financial reports. The European Commission compiles and publishes the dataset.
The most important variables are:
Company – Company nameCountry – Headquarters countryR&D (€million) – Total R&D spendingR&D intensity (%) – R&D spending divided by salesEmployees – Company sizeThe R&D spending variable measures company investment in innovation. We use it as a proxy for AI product development effort.
Our dataset may have several limitations. First, both datasets mainly include large companies. Small firms and startups are not fully represented. This creates selection bias. Second, the Scoreboard focuses on top global R&D investors. Companies from developing countries may be underrepresented. Third, AI publication data only measure published research. Companies may conduct private AI research that is not included.
The datasets do not contain personal data. This reduces privacy risk. The data sources are publicly documented, which supports transparency and accountability. However, the data do not measure social impact, worker outcomes, or fairness of AI systems. Therefore, the dataset does not fully reflect all human rights principles. Our analysis focuses only on corporate investment and research patterns. It should not be interpreted as a complete measure of ethical or societal impact of AI development.
First, we cleaned numeric columns (R&D spending, R&D intensity, AI patents) by converting them to numeric. This prevents plotting and calculations from failing due to hidden text formatting from Excel or CSV imports. Then we removed rows with missing or invalid values (e.g., missing R&D, missing patents, non-positive R&D). These rows cannot be used to compute “efficiency” and would distort comparisons.Also,we created a new categorical column for R&D intensity group (Low / Medium / High). This makes it easier to compare patterns across investment levels rather than only continuous values.Plus, we created a new numerical column for AI patenting efficiency (AI patents per €100M R&D). This directly measures “output per input,” which matches our research question about efficiency.Moreover, we created a summary dataframe by sector (average efficiency and typical R&D intensity). This helps us see whether certain industries systematically perform differently.Last, we made a graph to visually explore the relationship between R&D intensity and AI patenting efficiency.
# A tibble: 11 × 5
Sector n_companies avg_RD_intensity avg_efficiency_1B median_efficiency_1B
<chr> <int> <dbl> <dbl> <dbl>
1 Construc… 6 1.83 7638. 288.
2 ICT serv… 55 17.2 3839. 61.0
3 Industri… 33 3.21 2880. 139.
4 Others 53 5.80 2441. 25.7
5 Chemicals 22 3.29 2320. 16.9
6 Financial 7 4.20 1105. 0
7 Automobi… 20 5.25 742. 20.7
8 ICT prod… 89 12.4 596. 76.7
9 Health i… 47 14.1 440. 15.6
10 Energy 12 1.41 416. 51.8
11 Aerospac… 12 4.62 410. 24.6
Our group’s topic of interest is to investigate the relationship between corporate R&D investment intensity and AI innovation efficiency among the world’s top R&D-performing firms. Specifically, we examine whether companies that invest a higher proportion of their revenue into research and development generate AI patents more efficiently relative to their total R&D spending.Our lab section time is Wednesdays 10:30 AM - 11:20 PM in MGH284 and our TA is Runa He.
In this project, we used several tools beyond those directly covered in lecture.